Adding On-demand Training Data Notebook #162

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

KennSmithDS wants to merge 2 commits into microsoft:main from radiantearth:add/on-demand-data-request

KennSmithDS commented Apr 30, 2022

PR to merge the notebook tutorial for creating on-demand training data from the Planetary Computer data catalog when starting from a Radiant MLHub dataset.


          resolved click issue, notebook on README

911c455

KennSmithDS changed the title ~~Adding On-demand Training Data Notebooke~~ Adding On-demand Training Data Notebook

TomAugspurger reviewed

View reviewed changes

TomAugspurger left a comment

Thanks, I left a few comments inline. Still making my way through the example.

One general comment: I'm pretty uncomfortable having moderately complex code in this examples. I'd much prefer that things like temporal_buffer, mind_cloud_cover_scene and even get_landsat_8_match be generalized and put into a dedicated library, where it can be properly unit tested. With it here in a notebook it's not easy to test and not easy to reuse.

tutorials/radiant-mlhub-on-demand-training-data.ipynb Outdated

+                 "source": [
+                  "Once you have your API key, you will need to create a default profile by setting up a .mlhub/profiles file in your home directory. You can use the `mlhub configure` command line tool to do this:\n",
+                  "\n",
+                  "`$ mlhub configure`<br>\n",

TomAugspurger May 4, 2022

!mlhub configure --api-key={MLHUB_API_KEY}

as a regular code cell should work.

Author

KennSmithDS May 11, 2022

Done

tutorials/radiant-mlhub-on-demand-training-data.ipynb Outdated

+                 "id": "2ba7d7b2-7fed-4412-8c9b-448baad6e595",
+                 "metadata": {},
+                 "source": [
+                  "This helper function below encapsulates the process of querying a STAC API endpoint to fetch an ItemCollection matching query criteria."

TomAugspurger May 4, 2022

IMO this helper function isn't adding much value over just using catalog.search directly. I'd rather teach users how to use catalog.search.

Can you remove uses of search_stac_api and the client_catalog.search instead?

Author

KennSmithDS May 11, 2022

Done

tutorials/radiant-mlhub-on-demand-training-data.ipynb

+                 "id": "2e131160-50bb-487f-8fcb-96d37ce80167",
+                 "metadata": {},
+                 "source": [
+                  "We could certainly use the method above to query label Items directly from our connection to the Radiant MLHub API endpoint. However, on very large collections, such as in the case with BigEarthNet, pagination becomes a bottleneck issue in obtaining and resolving STAC items, as it only returns 100 items at a time.  Querying the entire Collection of nearly ~600,000 Items could take hours.\n",

TomAugspurger May 4, 2022

FWIW, the limit argument in pystac-client controls the size of the pages. But agreed that fetching 600,000 items through an API isn't what we should recommend.

tutorials/radiant-mlhub-on-demand-training-data.ipynb

+                  "if not os.path.exists(label_collection_path):\n",
+                  "    collection = Collection.fetch(BIGEARTHNET_LABEL_COLLECTION)\n",
+                  "    archive_path = collection.download(TMP_DIR)\n",
+                  "    !tar -xf {archive_path.as_posix()} -C {TMP_DIR}\n",

TomAugspurger May 4, 2022

It's unfortunate that this decompression takes so long :/ Any thoughts on if you can operate directly on the compressed .gz file? Probably not.

Author

KennSmithDS May 11, 2022

I agree, it is unfortunate how long the decompression takes.

I didn't have a great work around to this, as it is related to your previous comment about using limit, and taking a random sample of Items from the larger dataset of 600,000 Items.

I suppose if we fetch only a few thousand to begin with, pagination shouldn't be an issue then we don't need to deal with the .tar.gz file. However that likely won't be a random result, I'm guessing the API will just return the first XXX Items?

tutorials/radiant-mlhub-on-demand-training-data.ipynb

+                 "id": "e419d96f-f8b9-4911-b5d1-4a4077767062",
+                 "metadata": {},
+                 "source": [
+                  "If we had the source collection archive downloaded and uncompressed in the same parent directory as the labels collection, we could reference the source Items and images directly. However the BigEarthNet source collection is over 60GB when compressed. Therefore to work around the disk size limitations of a Planetary Computer instance, we can query the same source items from the MLHub API endpoint, the same way we got the labels, but filter to the exact source item using IDs."

TomAugspurger May 4, 2022

What's your preferred workflow as a user, and from MLHub's point of view? Do you want people making local copies of this dataset, or do you want them fetching from your storage on demand?

IMO, the ideal workflow is on-demand fetching from blob storage in the same region + caching, but I'm curious what you think.

Author

KennSmithDS May 11, 2022

In my short time here at Radiant, I've seen a divergence between what is a preferred workflow for our users vs. the workflow most of our data users/consumers follow.

Personally I agree that the ideal workflow is an on-demand fetching of assets from blob storage via traversing a catalog/STAC API on a VM or notebook server in the same region. However, my assumption is that a majority of our users are downloading the datasets directly to their local computers so they can do machine learning, or other geospatial analytics on them, or many aren't familiar enough with STAC/PySTAC and STAC APIs to fetch the data that way.

Also with any on-demand hosted notebook server environment comes the issue of limited persistent volume sizes, especially if folks are accustomed to downloading instead of fetching and caching without writing to disk.

tutorials/radiant-mlhub-on-demand-training-data.ipynb

+                 },
+                 "outputs": [],
+                 "source": [
+                  "if best_l8_match:\n",

TomAugspurger May 4, 2022

Same question about potentially not having a match.

Author

KennSmithDS May 11, 2022

See above for if source_items comment thread

tutorials/radiant-mlhub-on-demand-training-data.ipynb

+                 },
+                 "outputs": [],
+                 "source": [
+                  "explore_search_extent(ItemCollection([best_l8_match]))"

TomAugspurger May 4, 2022

Can you also plot the s2 chip's bound here?

Author

KennSmithDS May 11, 2022

Similar to other comment threads, would it be best to strip this out of the helper function, and just have a few cells of repeated code for exploring the API search results?

Author

KennSmithDS May 11, 2022

Clarifying question on this, do you mean the overall bounds of the ItemCollection returned (e.g. minX, minY, maxX, maxY)? Each Item returned from the API search will have it's own bounding box.

tutorials/radiant-mlhub-on-demand-training-data.ipynb Outdated

+                 "metadata": {},
+                 "outputs": [],
+                 "source": [
+                  "client = dd_client(\n",

TomAugspurger May 4, 2022

Just use dask.distributed.Client() (or a GatewayCluster & get_client if doing this on a distributed cluster)

Author

KennSmithDS May 11, 2022

for both ps_client and dd_client I simplified by removing the import alias, so users will directly use distributed.Client and pystac_client.Client

tutorials/radiant-mlhub-on-demand-training-data.ipynb Outdated

+                 "outputs": [],
+                 "source": [
+                  "# this cell will only work on PC or a machine with gateway cluster configured\n",
+                  "# gateway = dask_gateway.Gateway()\n",

TomAugspurger May 4, 2022

IMO: pick whether or not you want this example to run on a cluster. If it runs without a cluster, then we can just remove this.

Author

KennSmithDS May 11, 2022

Removed

tutorials/radiant-mlhub-on-demand-training-data.ipynb

+                  "        Landsat 8 DataArray that has been cropped to label bbox\n",
+                  "    \"\"\"\n",
+                  "    # read label Item object\n",
+                  "    label_item = Item.from_file(\n",

TomAugspurger May 4, 2022

This, and the other functions touch the local filesystem won't work with a distributed cluster. They'd only work with a LocalCluster.

If we're mentioning possibly using a distributed cluster then you'd need to restructure this. Most likely, you'd need to store everything to Azure Blob Storage and use it as a kind of shared file system that each worker could read / write to.

Author

KennSmithDS May 11, 2022

For purposes of this tutorial, seems to make sense to focus on the workflow of gathering and processing the Landsat 8 data off of Planetary Computer with a LocalCluster, and not convoluting that with performing this workflow on a distributed cluster.

Author

KennSmithDS commented May 11, 2022

@TomAugspurger do you know why the stackstac.stack function will add a buffer in the ndarrays returned? Does it have to do with how it reprojects the image data when it's cached?

For example in this block of code, I'm fetching the Sentinel-2 source imagery from the Azure Blob storage for our MLHub. We know the chips all to be 120x120 pixels, but the stack object dimensions vary from 122x122 up to 130x130.

s2_stack = stack( items=ItemCollection([source_item]), assets=BIGEARTHNET_RGB_BANDS, epsg=rio.open(get_redirect_url(source_item.assets["B02"])).crs.to_epsg(), resolution=10, )

P.S. sorry I don't know how you're doing the cool Jupyter Notebook integration.


          linter issue with dask

7fb47b6

KennSmithDS mentioned this pull request

Adding On-demand Training Data Notebook - (supersedes #162) #171

Open

KennSmithDS closed this

Author

KennSmithDS commented May 12, 2022

Closing in favor of #171 due to rebasing issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet